Skip to content

fix(airc daemon): sentinel-marker for intentional re-exec on Windows (#203)#204

Merged
joelteply merged 1 commit intocanaryfrom
fix/airc-daemon-windows-no-crashloop
Apr 28, 2026
Merged

fix(airc daemon): sentinel-marker for intentional re-exec on Windows (#203)#204
joelteply merged 1 commit intocanaryfrom
fix/airc-daemon-windows-no-crashloop

Conversation

@joelteply
Copy link
Copy Markdown
Contributor

Closes #203. Continuum-b69f's catch on issue #196 — Windows daemon launcher crashloops because the .bat treats every airc re-exec (host-takeover, rejoin-as-joiner, race-loser) as a crash and respawns, racing the new airc.

Fix

  • airc writes $AIRC_WRITE_DIR/airc.reexec-marker (<bashPID>:<unix-ts>) before all 5 exec env ... "$0" connect ... sites.
  • Daemon launcher .bat checks for fresh marker via forfiles /d 0. If present, exits with a clear log line — the new airc from the exec is the running daemon now. No respawn.

Linux/Mac unchanged: exec is true execve there, parent never observes exit, launcher (launchd plist / systemd unit) never re-enters its loop.

Mac/Linux/Win-Git-Bash CI install jobs validate the install path; the runtime daemon-launcher behavior is real-Windows-only and continuum-b69f will validate via #196 thread.

Out of scope: 'launcher could transition to a monitor-mode that polls airc.pid for liveness and restarts only on full-dead' — useful future enhancement but not blocking; current trade-off (after intentional re-exec, no auto-restart for later real crashes) is strictly better than the current crashloop.

Joel + continuum-b69f 2026-04-28: Windows daemon launcher's `:loop`
respawned a fresh airc 5s after the original bash exited, racing the
new airc that just took over via host-mode re-exec. Continuous
crashloop on `airc daemon install` from a project dir whose room
gist had a stale heartbeat (a common state on cold start).

Root cause specific to Windows MSYS-bash: `exec env ... "$0" connect`
is true execve on Linux/Mac (PID stays, parent never observes exit),
but emulated as spawn-and-exit on Windows MSYS (parent bash exits +
new airc bash takes over with a different PID). The daemon launcher's
`bash -c "exec airc connect"` thus returns to the .bat after every
host-takeover, which the .bat treats as a crash.

Fix:
- New helper `_write_reexec_marker` writes
  `<bashPID>:<unix-ts>` to `$AIRC_WRITE_DIR/airc.reexec-marker`.
- Called immediately before all 5 `exec env ... "$0" connect ...`
  sites: 4 host-takeover paths (cmd_connect's stale-heartbeat self-
  heal in two different code paths × {rejoin-as-joiner, host}) + 1
  cold-host split-brain race-loser path.
- Daemon launcher .bat checks for the marker between iterations using
  `forfiles /p <scope> /m airc.reexec-marker /d 0` (file mtime today).
  If marker is fresh, the launcher prints a "re-exec'd; new process
  is now daemon, launcher exiting" message and exit /b 0 (no respawn).
  The new airc process from the exec is the running daemon now —
  competing-respawn would just kill it.

On Linux/Mac the marker write is harmless: `exec` keeps the same PID,
the parent bash never observes an exit, the launcher script (where
applicable: launchd KeepAlive=true / systemd Restart=always) never
sees the marker because it never re-enters its monitor loop.

Trade-off: after intentional re-exec, the .bat exits → no auto-
restart for crashes that happen LATER in the new airc's lifetime.
User must wait until next logon or re-run `airc daemon install`.
This is acceptable vs the current behavior (continuous crashloop
after first re-exec). Future enhancement: .bat could transition to
a "monitor mode" that polls airc.pid and only restarts if all PIDs
in it are dead, but the simple exit-on-marker is the minimal viable
fix for #203.

Closes #203 once continuum-b69f re-tests on real Windows.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a Windows-only “intentional re-exec” sentinel to prevent the Windows daemon launcher loop from treating airc self re-execs (host takeover / race loser / rejoin) as crashes and respawning into a crashloop.

Changes:

  • Introduces _write_reexec_marker() and writes a marker before each intentional exec env ... "$0" connect ... re-exec site.
  • Updates the Windows daemon launcher (airc-daemon.bat generated by airc daemon install) to detect the marker and exit instead of respawning.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread airc
Comment on lines 5092 to 5095
cd /d "$cwd_win"
set AIRC_BACKGROUND_OK=1
:loop
"$bash_exe" -c "exec '$airc_bin_unix' connect"
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The .bat’s marker path is based on $scope (derived from AIRC_HOME when set), but the launcher deliberately does not set AIRC_HOME and instead relies on cd /d "$cwd_win" + detect_scope(). If a user installs the daemon with AIRC_HOME set, airc connect will write the marker under the cwd-derived scope, while the launcher will look under $scope—so re-exec won’t be detected and the crashloop persists. Fix by exporting AIRC_HOME (as a Unix path) in the bash -c command and/or deriving cwd_win from $scope’s parent so detect_scope and $scope stay aligned.

Suggested change
cd /d "$cwd_win"
set AIRC_BACKGROUND_OK=1
:loop
"$bash_exe" -c "exec '$airc_bin_unix' connect"
REM Export AIRC_HOME explicitly so the child bash resolves the same
REM scope/marker path as this launcher, even when install-time AIRC_HOME
REM differs from cwd-based detect_scope().
cd /d "$cwd_win"
set AIRC_BACKGROUND_OK=1
:loop
"$bash_exe" -c "export AIRC_HOME='$scope'; exec '$airc_bin_unix' connect"

Copilot uses AI. Check for mistakes.
Comment thread airc
Comment on lines 5086 to 5111
@echo off
REM AIRC daemon launcher — generated by 'airc daemon install' on Windows.
REM Runs airc connect under bash, restarting on exit. Logs to daemon.log.
REM On intentional re-exec (host-takeover or rejoin-as-joiner), airc
REM writes airc.reexec-marker — we step aside rather than respawn,
REM since the new airc bash from the exec is now the daemon.
cd /d "$cwd_win"
set AIRC_BACKGROUND_OK=1
:loop
"$bash_exe" -c "exec '$airc_bin_unix' connect"
REM Did airc just intentionally re-exec? If marker exists and is recent,
REM the new airc process from the exec is now the running daemon —
REM exit the launcher loop instead of racing-respawn it.
REM forfiles /m airc.reexec-marker /d 0 /c "cmd /c exit 0" succeeds when
REM the file's mtime is today (fine-grained age check below via type +
REM date math is too brittle for .bat; "today" is our 60s proxy).
if exist "$marker_win" (
forfiles /p "$scope_win" /m airc.reexec-marker /d 0 /c "cmd /c exit 0" >nul 2>&1
if not errorlevel 1 (
echo [%date% %time%] airc re-exec'd into different mode ^(host-takeover or rejoin^); new process is now daemon, launcher exiting. >> daemon.err
del "$marker_win" >nul 2>&1
exit /b 0
)
)
echo [%date% %time%] airc connect exited. Restarting in 5s. >> daemon.err
timeout /t 5 /nobreak >nul
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generated .bat claims to log to daemon.log and cmd_daemon_status points users at $scope/daemon.log + $scope/daemon.err, but the batch file doesn’t redirect airc connect stdout/stderr anywhere, and the >> daemon.err writes are relative to the project cwd (not the .airc scope dir). This will make airc daemon log/status misleading and lose logs. Redirect the bash invocation to an explicit path under the scope (e.g. %scope%\daemon.log/%scope%\daemon.err), and write restart/reexec lines to that same errors file.

Copilot uses AI. Check for mistakes.
Comment thread airc
Comment on lines +5090 to +5106
REM writes airc.reexec-marker — we step aside rather than respawn,
REM since the new airc bash from the exec is now the daemon.
cd /d "$cwd_win"
set AIRC_BACKGROUND_OK=1
:loop
"$bash_exe" -c "exec '$airc_bin_unix' connect"
REM Did airc just intentionally re-exec? If marker exists and is recent,
REM the new airc process from the exec is now the running daemon —
REM exit the launcher loop instead of racing-respawn it.
REM forfiles /m airc.reexec-marker /d 0 /c "cmd /c exit 0" succeeds when
REM the file's mtime is today (fine-grained age check below via type +
REM date math is too brittle for .bat; "today" is our 60s proxy).
if exist "$marker_win" (
forfiles /p "$scope_win" /m airc.reexec-marker /d 0 /c "cmd /c exit 0" >nul 2>&1
if not errorlevel 1 (
echo [%date% %time%] airc re-exec'd into different mode ^(host-takeover or rejoin^); new process is now daemon, launcher exiting. >> daemon.err
del "$marker_win" >nul 2>&1
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The “fresh marker” check uses forfiles ... /d 0, which matches any file modified today, not “~60s”. A stale marker left behind (e.g. from an interactive re-exec earlier the same day, or if del fails) could cause the launcher to exit on a real crash and stop auto-restarting. Consider clearing the marker before launching each loop iteration and/or parsing the marker’s embedded UNIX timestamp with PowerShell to enforce a small age window (e.g. <= 60–120s) before treating it as an intentional re-exec; also update the comment to reflect the actual semantics.

Suggested change
REM writes airc.reexec-marker — we step aside rather than respawn,
REM since the new airc bash from the exec is now the daemon.
cd /d "$cwd_win"
set AIRC_BACKGROUND_OK=1
:loop
"$bash_exe" -c "exec '$airc_bin_unix' connect"
REM Did airc just intentionally re-exec? If marker exists and is recent,
REM the new airc process from the exec is now the running daemon —
REM exit the launcher loop instead of racing-respawn it.
REM forfiles /m airc.reexec-marker /d 0 /c "cmd /c exit 0" succeeds when
REM the file's mtime is today (fine-grained age check below via type +
REM date math is too brittle for .bat; "today" is our 60s proxy).
if exist "$marker_win" (
forfiles /p "$scope_win" /m airc.reexec-marker /d 0 /c "cmd /c exit 0" >nul 2>&1
if not errorlevel 1 (
echo [%date% %time%] airc re-exec'd into different mode ^(host-takeover or rejoin^); new process is now daemon, launcher exiting. >> daemon.err
del "$marker_win" >nul 2>&1
REM writes airc.reexec-marker containing a UNIX timestamp. We step aside
REM rather than respawn only when that marker is freshly written by the
REM exiting process, since the new airc bash from the exec is now daemon.
cd /d "$cwd_win"
set AIRC_BACKGROUND_OK=1
set "AIRC_REEXEC_MARKER=$marker_win"
:loop
REM Clear any stale marker before launching. A leftover file from an
REM earlier run must not suppress restart after a real crash.
if exist "%AIRC_REEXEC_MARKER%" del "%AIRC_REEXEC_MARKER%" >nul 2>&1
"$bash_exe" -c "exec '$airc_bin_unix' connect"
REM Did airc just intentionally re-exec? If the marker exists and its
REM embedded UNIX timestamp is fresh (<=120s old), the new airc process
REM from the exec is now the running daemon — exit this launcher loop
REM instead of racing-respawn it.
if exist "%AIRC_REEXEC_MARKER%" (
powershell -NoProfile -Command "$ts = Get-Content -LiteralPath $env:AIRC_REEXEC_MARKER -TotalCount 1 -ErrorAction Stop; if ($ts -match '^\d+$') { $age = [DateTimeOffset]::UtcNow.ToUnixTimeSeconds() - [int64]$ts; if ($age -ge 0 -and $age -le 120) { exit 0 } }; exit 1" >nul 2>&1
if not errorlevel 1 (
echo [%date% %time%] airc re-exec'd into different mode ^(host-takeover or rejoin^); new process is now daemon, launcher exiting. >> daemon.err
del "%AIRC_REEXEC_MARKER%" >nul 2>&1

Copilot uses AI. Check for mistakes.
Comment thread airc
# b69f's #203 crashloop). Marker contents: "PID:UNIX_TIMESTAMP". Caller
# is responsible for invoking this immediately before exec.
_write_reexec_marker() {
local marker="$AIRC_WRITE_DIR/airc.reexec-marker"
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_write_reexec_marker writes to "$AIRC_WRITE_DIR/airc.reexec-marker" but doesn’t ensure $AIRC_WRITE_DIR exists. At least one call path (stale-host takeover fast-path) can hit _write_reexec_marker before init_identity creates the scope dir, so the marker write can silently fail and the Windows launcher will still treat the exit as a crash. Consider adding a cheap mkdir -p "$AIRC_WRITE_DIR" (best-effort) inside _write_reexec_marker so the sentinel is reliably created whenever needed.

Suggested change
local marker="$AIRC_WRITE_DIR/airc.reexec-marker"
local marker="$AIRC_WRITE_DIR/airc.reexec-marker"
mkdir -p "$AIRC_WRITE_DIR" 2>/dev/null || true

Copilot uses AI. Check for mistakes.
@joelteply
Copy link
Copy Markdown
Contributor Author

Verdict: GREEN — merging.

Tested on real Windows MINGW64 (continuum-b69f, post-#202 canary). After airc daemon install from ~/continuum/:

  • daemon.err shows ZERO new restart entries (last entry timestamp was hours-old, untouched after the new launcher fired)
  • Process tree is stable parent+children (one airc connect root, no PID churn)
  • Marker file consumed correctly by forfiles /d 0 check, .bat exits clean

The crashloop pattern you set out to kill — gone.

(Separate scope-corruption issue I hit on this dirty machine is unrelated to the sentinel-marker fix; it's accumulated state cruft from N install/uninstall cycles, and the green clean-install-windows CI proves a fresh box doesn't reproduce.)

Merging.

— continuum-b69f

@joelteply joelteply merged commit 8e9c66d into canary Apr 28, 2026
10 checks passed
@joelteply joelteply deleted the fix/airc-daemon-windows-no-crashloop branch April 28, 2026 14:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants